This section is adopted from the module 1, section 1 of Open Nighttime Lights. Please refer to it for more details and a video introduction of remote sensing.
Remote sensing is the science of identifying, observing, collecting and measuring objects without coming into direct contact with them. This can be accomplished through many devices that carry sensors and capture the characteristics of Earth remotely. Sensors on board satellites also record the electromagnetic energy that is reflected or emitted from objects on Earth. There are two types of sensors:
In this exercise, we will primarily work on the Vegetation Continuous Fields (VCF) provided by. The MOD44B Version 6 VCF is an annual data from 2000 to 2020 at 250 m resolution. Each pixel stores percentage of ground cover components. Three components of ground cover: percent tree cover, percent non-tree vegetation and percent bare, are included in separate layers.
The percents of ground cover are estimates from a machine learning model based on the combination of the Moderate Resolution Imaging Spectroradiometer (MODIS) data and other high resolution data from NASA and Google Earth. The machine learning model incorporates not only the visible bandwidth but other bandwidth such as brightness temperature (from MODIS bands 20, 31, 32).
The VCF data utilizes thermal signatures and other correlates to distinguish forest and non-forest plantation, which is an improvement compared to the Normalized Differenced Vegetation Index (NDVI). The Global Forest Cover (GFC) data set also describes deforestation but in a binary fashion. It records baseline forest cover in the year 2000 and includes a binary indicator for the year of deforestation for each 30m × 30m pixel. If over 90% of the forest cover was lost in a pixel by a given year, then the pixel will be marked as deforested, or reforested if the forest cover went from 0 in 2000 to positive. The VCF provide continuous percentage changes in percents of ground components cover which provides more details than the GFC data.
In geospatial data analysis, data can be classified into two categories: raster and vector data. A graphic comparison between raster and vector data can be found in this page.
In this tutorial, we will use both vector data and raster data. Geospatial data in vector format are often stored in a shapefile format. Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile. Raster data, on the other hand, is stored in Tagged Image File Format (TIFF or TIF). A GeoTIFF is a TIFF file that follows a specific standard for structuring meta-data. The meta-data stored in a TIFF is called a tif tag and GeoTIFFs often contain tags including: spatial extent, coordinate reference system, resolution, and number of layers. We will see examples in section 2.2.
More information and examples can be found in section 3 & 4 of the Earth Analytics Course.
To get started with R, we provide instructions on how to download and install R on your end. R is an open source software, which means users like you can also inspect, modify and improve its source code.
The Comprehensive R Archive Network (CRAN) provides links to install R under different operating systems. RStudio page provides a brief guide for troubleshooting.
RStudio is an integrated development environment for R. We can think R provides the engine for running codes, and RStudio is a user friendly control panel to perform various tasks. RStudio facilitates R codes writing, debugging and provides tools for workspace management. RStudio can be downloaded from the RStudio IDE page.
There are numerous posts, tutorials, courses on the internet. Once you have installed R and RStudio, pick any of the following resource to get familiar with R:
In order to perform data manipulation, we need to attach packages. The first step is downloading R packages from CRAN. For this exercise, we are going to use package luna and to download data from MODIS, and use terra, tidyverse, raster and sf for data manipulation. To do this, in R or RStudio console, type the following code (you will need the remotes package in order to download luna):
install.packages(c("terra", "remotes", "tidyverse", "raster", "sf"))
remotes::install_github("rspatial/luna")
We follow this tutorial to get MODIS data by luna. For details of the terra package, please refer to the package manuscript and this tutorial. If you are not familiar with the tidyverse workflow, please refer to R for Data Science that we suggested in the previous section. We now attach these packages.
library(terra)
library(luna)
library(tidyverse)
library(raster)
library(sf)
Once packages have been attached, we can access VCF in R. This website in NASA has various tools to access MODIS data, but these tools are not ideal for downloading large number of files or keeping data up-to-date automatically.
We can check the list of data products on MODIS. Since luna can also access data from LANDSAT and SENTINEL platforms, we add "^MOD|^MYD|^MCD" to narrow our scope to MODIS data. The printed results below listed six products from MODIS.
MODIS.product = getProducts("^MOD|^MYD|^MCD")
head(MODIS.product)
## provider concept_id short_name version
## 806 LPDAAC_ECS C1000000120-LPDAAC_ECS MOD44B 051
## 1433 LPDAAC_ECS C1000000400-LPDAAC_ECS MCD43D10 006
## 1447 LPDAAC_ECS C1000000401-LPDAAC_ECS MCD43D33 006
## 1452 LPDAAC_ECS C1000000402-LPDAAC_ECS MCD43D45 006
## 1454 LPDAAC_ECS C1000000403-LPDAAC_ECS MCD43D26 006
## 1456 LPDAAC_ECS C1000000404-LPDAAC_ECS MCD43D49 006
The product name for VCF is MOD44B. We can use function productInfo to launch the information page of VCF.
productInfo("MOD44B")
We can query MODIS and only download subset of the data. We need to specify start and end dates, and area of interest (AOI). The date format is “yyyy-mm-dd”. Suppose here we want to subset data from 2010 to 2012.
start.date = "2010-01-01"
end.date = "2012-12-31"
In order to subset your area of interest, you need to provide a “map” to getModis(). This can be obtained from online databases. One example is from the global administrative area database (GADM). You can download map data directly from GADM like in this post. Or you can use R to obtain data. To do this, you need to install package geodata.
remotes::install_github("rspatial/geodata")
For example, we are interested in India and its administrative areas, we can download India and its administrative area boundaries by
india = geodata::gadm("India", level=2, path=".")
The boundary data is then downloaded to the path that you specified in the path argument. The downloaded data through gadm() will be in the PackedSpatVector class. If you want to convert it to other class (for example, sf class), you can first read it using readRDS(), then create SpatVector via vect() in the terra package, and finally convert it to a sf object.
india = readRDS("./data/gadm36_IND_2_pk.rds") %>% vect() %>% st_as_sf(india)
Notice that levels in GADM are defined as:
The map we download is at the district level. Assume our AOI is the Odisha state. aoi is in vector format and is stored in as a data frame in R. Each row of the data represents a county in Odisha state. The geospatial information for each county is stored in the last column geometry.
aoi = india %>% filter(NAME_1 == "Odisha")
head(aoi)
## Simple feature collection with 6 features and 13 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: 82.63184 ymin: 20.14274 xmax: 87.48266 ymax: 21.97341
## CRS: BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
## GID_0 NAME_0 GID_1 NAME_1 NL_NAME_1 GID_2 NAME_2
## 1 IND India IND.26_1 Odisha NA IND.26.1_1 Anugul
## 2 IND India IND.26_1 Odisha NA IND.26.2_1 Balangir
## 3 IND India IND.26_1 Odisha NA IND.26.3_1 Baleshwar
## 4 IND India IND.26_1 Odisha NA IND.26.4_1 Bargarh
## 5 IND India IND.26_1 Odisha NA IND.26.5_1 Bauda
## 6 IND India IND.26_1 Odisha NA IND.26.6_1 Bhadrak
## VARNAME_2 NL_NAME_2 TYPE_2 ENGTYPE_2 CC_2 HASC_2
## 1 NA NA District District NA IN.OR.AN
## 2 Balangir NA District District NA IN.OR.BL
## 3 Balasore, Baleswar NA District District NA IN.OR.BW
## 4 NA NA District District NA IN.OR.BR
## 5 NA NA District District NA IN.OR.BD
## 6 NA NA District District NA IN.OR.BH
## geometry
## 1 POLYGON ((84.74501 20.60569...
## 2 POLYGON ((82.81747 20.16052...
## 3 MULTIPOLYGON (((87.12389 21...
## 4 POLYGON ((83.29642 20.84884...
## 5 POLYGON ((84.78773 20.57517...
## 6 MULTIPOLYGON (((87.01651 20...
ggplot(data = aoi) +
geom_sf()
We now have time range and AOI ready, we can check what MODIS has for VCF.
vcf.files = getModis("MOD44B", start.date, end.date, aoi, download = F)
head(vcf.files)
## [1] "MOD44B.A2009065.h26v07.006.2017081034835.hdf"
## [2] "MOD44B.A2009065.h25v07.006.2017081034600.hdf"
## [3] "MOD44B.A2009065.h26v06.006.2017081034818.hdf"
## [4] "MOD44B.A2009065.h25v06.006.2017081034537.hdf"
## [5] "MOD44B.A2010065.h26v07.006.2017081045754.hdf"
## [6] "MOD44B.A2010065.h25v07.006.2017081045519.hdf"
The products we are going to download are tiled products. For details of tiled product and the tilling system, please refer to MODIS overview page here. We will basically download grids of maps that cover our AOI. The naming convention is also explained in the overview page.
To download these files, you can use the following code,
getModis("MOD44B", start.date, end.date, aoi, download = T, path = YourPathHere, username = YourNASAUserName, password = YourPassWord)
You will need user name and password to interact with the NASA server. Please register on NASA Earth Data if you haven’t done so.
The data format from MODIS is HDF that may include sub-datasets. We can use terra to read these files and create raster files. For example,
hdf.example = rast("./data/VCFexample/MOD44B.A2009065.h25v06.006.2017081034537.hdf")
hdf.example
## class : SpatRaster
## dimensions : 4800, 4800, 7 (nrow, ncol, nlyr)
## resolution : 231.6564, 231.6564 (x, y)
## extent : 7783654, 8895604, 2223901, 3335852 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs
## sources : MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_Tree_Cover
## MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_NonTree_Vegetation
## MOD44B.A2009065.h25v06.006.2017081034537.hdf:MOD44B_250m_GRID:Percent_NonVegetated
## ... and 4 more source(s)
## names : MOD44~Cover, MOD44~ation, MOD44~tated, MOD44~ality, MOD44~er_SD, MOD44~ed_SD, ...
We can find basic information such as coordinate reference system, number of cells, resolution from the above output. There are 7 layers in each of the VCF tiled file. We are interested in the percent tree coverage.
names(hdf.example)
## [1] "MOD44B_250m_GRID:Percent_Tree_Cover"
## [2] "MOD44B_250m_GRID:Percent_NonTree_Vegetation"
## [3] "MOD44B_250m_GRID:Percent_NonVegetated"
## [4] "MOD44B_250m_GRID:Quality"
## [5] "MOD44B_250m_GRID:Percent_Tree_Cover_SD"
## [6] "MOD44B_250m_GRID:Percent_NonVegetated_SD"
## [7] "MOD44B_250m_GRID:Cloud"
A quick plot of the data can be done with the plotRBG() function. We will dive into data processing and basic operations in the next section.
plotRGB(hdf.example, stretch = "lin")
In this section, we will follow this tutorial to process data that we have downloaded from MODIS. Note that terra is a relatively new package, although it is maintained frequently, a lot of other packages have not yet adopted terra objects (i.e. SpatRaster). Hence, we sometimes will use raster package to process data.
In previous section, we know that VCF data has 7 layers. If we directly convert SpatRaster to RasterLayer, it will only convert the first layer by default. For discussion of different class of raster, please see section 4.1 of the R as GIS for Economist page listed above.
Since there are four hdf files in each year for our AOI, let’s first merge four SpatRaster files into one in a year, say 2010. We only read the first layer (percent of tree cover) of each hdf file, which can be done by subset the output using [[1]].
# getting file names and directories
vcf.files.2010 = paste0("./data/VCFexample/", vcf.files[grep("A2010065", vcf.files)])
# read hdf files as SpatRaster
vcf.raster.2010 = lapply(vcf.files.2010, function(x) rast(x)[[1]])
Before we merge these SpatRster, it is often a good practice to check origin and resolution of them. merge requires origin and resolution to be the same across objects.
lapply(vcf.raster.2010, res)
## [[1]]
## [1] 231.6564 231.6564
##
## [[2]]
## [1] 231.6564 231.6564
##
## [[3]]
## [1] 231.6564 231.6564
##
## [[4]]
## [1] 231.6564 231.6564
lapply(vcf.raster.2010, origin)
## [[1]]
## [1] -2.996996e-06 1.000240e-06
##
## [[2]]
## [1] 5.004928e-06 1.000240e-06
##
## [[3]]
## [1] -2.996996e-06 -1.000240e-06
##
## [[4]]
## [1] 5.004928e-06 -1.000240e-06
Note that origins of these files are slightly different but all close to (0, 0). We do not need to worry about it, as merge will handle it automatically.
vcf.raster.2010 = do.call(merge, vcf.raster.2010)
plot(vcf.raster.2010)
We are now ready to crop and mask the raster file given our AOI. This tutorial explains the difference between cropping and masking.
To crop a raster file given a vector data, first match the coordinate reference systems of raster file and vector file. Then use crop(raster data, vector data). To mask, use mask(raster data, vector data). Note that for terra::mask(), the second argument need to be SpatVector. terra does not support sf objects yet.
# align coordinate reference systems
aoi = aoi %>% st_transform(crs = crs(vcf.raster.2010))
# crop raster data
vcf.raster.2010.aoi = terra::crop(vcf.raster.2010, aoi)
# mask raster data
vcf.raster.2010.aoi = mask(vcf.raster.2010.aoi, vect(aoi))
To plot your new raster file with boundaries, we do the following
plot(vcf.raster.2010.aoi)
plot(st_geometry(aoi), add = TRUE)
After we have crop and mask the raster file to our AOI, we can extract values for each county in the state of Odisha. For instructions of extracting values for points, please refer to section 5.2 in R as GIS for Economist.
# extract values for each county
aoi.county.vcf = terra::extract(vcf.raster.2010.aoi, vect(aoi))
colnames(aoi.county.vcf) = c("ID", "Percent Tree Cover")
head(aoi.county.vcf)
## ID Percent Tree Cover
## 1 1 25
## 2 1 15
## 3 1 16
## 4 1 4
## 5 1 4
## 6 1 3
The values extracted by terra::extract here is stored as a data frame. Note that the ID corresponds to row number of your vector file (i.e. object aoi in our case). We can then compute statistics based on this data frame. Here I compute several statistics of percent of forest cover, such as mean, median, max, min, and percent of positive forest covered land, for each county. Note that cells with 200% represent water and river and should be excluded from calculation.
aoi.summary = aoi.county.vcf %>% filter(`Percent Tree Cover` <= 100) %>%
group_by(ID) %>%
summarise(Mean = mean(`Percent Tree Cover`),
Median = median(`Percent Tree Cover`),
Max = max(`Percent Tree Cover`),
Min = min(`Percent Tree Cover`),
`Positive Percent` = sum(`Percent Tree Cover` > 0)/length(`Percent Tree Cover`) * 100)
aoi.summary = aoi.summary %>% mutate(ID = aoi$NAME_2) %>% rename(County = ID)
knitr::kable(aoi.summary, digits = 2)
| County | Mean | Median | Max | Min | Positive Percent |
|---|---|---|---|---|---|
| Anugul | 14.71 | 9 | 70 | 0 | 99.62 |
| Balangir | 6.16 | 4 | 64 | 0 | 99.92 |
| Baleshwar | 9.98 | 6 | 73 | 0 | 99.98 |
| Bargarh | 5.90 | 3 | 64 | 0 | 99.97 |
| Bauda | 14.61 | 8 | 67 | 0 | 99.52 |
| Bhadrak | 5.71 | 4 | 65 | 0 | 99.88 |
| Cuttack | 12.03 | 8 | 66 | 0 | 95.21 |
| Debagarh | 12.14 | 5 | 62 | 0 | 99.99 |
| Dhenkanal | 13.82 | 8 | 69 | 0 | 99.24 |
| Gajapati | 24.69 | 25 | 70 | 1 | 100.00 |
| Ganjam | 19.17 | 13 | 71 | 0 | 100.00 |
| Jagatsinghapur | 9.44 | 9 | 68 | 0 | 99.60 |
| Jajapur | 7.17 | 4 | 62 | 0 | 99.11 |
| Jharsuguda | 5.46 | 4 | 48 | 0 | 99.56 |
| Kalahandi | 13.82 | 8 | 68 | 0 | 99.96 |
| Kandhamal | 27.86 | 28 | 68 | 1 | 100.00 |
| Kendrapara | 9.99 | 8 | 73 | 0 | 99.98 |
| Kendujhar | 11.30 | 4 | 67 | 0 | 99.89 |
| Khordha | 13.42 | 9 | 66 | 0 | 99.99 |
| Koraput | 11.08 | 5 | 68 | 0 | 99.97 |
| Malkangiri | 13.15 | 8 | 69 | 0 | 99.97 |
| Mayurbhanj | 16.12 | 5 | 73 | 0 | 100.00 |
| Nabarangapur | 4.60 | 2 | 61 | 0 | 99.96 |
| Nayagarh | 23.48 | 21 | 70 | 0 | 99.84 |
| Nuapada | 6.98 | 4 | 41 | 0 | 99.97 |
| Puri | 12.88 | 11 | 68 | 0 | 99.97 |
| Rayagada | 19.80 | 20 | 69 | 1 | 100.00 |
| Sambalpur | 12.48 | 7 | 63 | 0 | 99.97 |
| Subarnapur | 6.51 | 4 | 39 | 0 | 99.20 |
| Sundargarh | 9.80 | 4 | 66 | 0 | 99.79 |
With terra you can easily write shape files and several formats of raster files. Main function for writing vector data is writeVector() and for writing raster data is writeRaster(). For details, you can refer to this page and the manuscript of terra.
In this section, we will replicate some main results in the paper The Ecological Impact of Transportation Infrastructure by Sam Asher, Teevrat Garg and Paul Novosad in 2020. To access the full replication data and codes, check this github repo. We are going to replicate Table 3 in the paper.
The research question is whether new constructed rural roads have impacts on local deforestation. Authors used two empirical strategies: fuzzy RD and difference-in-difference. In the following sections, we implement the difference-in-difference method and replicate regression results.
In order to run fixed effects models, we will need the fixest package. If you have not download it, you can type install.packages("fixest") in the R or RStudio console.
library(fixest)
This tutorial is a good reference for introducing fixest functions.
Data for this exercise was processed and stored in pmgsy_trees_rural_panel.csv. Each row of the data frame presents a village in a specific year.
rural.data = data.table::fread("./data/pmgsy_trees_rural_panel.csv")
The paper estimated the following equation: \[ Forest_{vdt} = \beta_1\cdot Award_{vdt} + \beta_2\cdot Complete_{vdt} + \alpha_v + \gamma_{dt} + X_v\cdot V_t + \eta_{vdt} \] where \(Forest_{vdt}\) is forest cover of village \(v\) in district \(d\) in year \(t\). \(Award_{vdt}\) is a dummy variable takes one during the period when the new road is awarded to the village but has not been built. \(Complete_{vdt}\) is also a dummy variable takes one for all years following the completion of a new road to village \(v\). \(\alpha_v\) is village fixed effects, while \(\gamma_{dt}\) is the district-year fixed effects. \(X_v\) controls some baseline characteristics (e.g. forest cover in 2000, total population), and is interacted with year fixed effects \(V_t\). For more details regarding the model specification, please refer to section 3.3 of the paper.
The syntax is straightforward, we allow baseline forest cover and total population have varying slopes across years. Note that the paper clustered standard error at the village level.
There is one more step before we run regressions. In Stata, reghdfe removed singleton groups automatically, but the fixest package has a problem with this (hopefully will be resolved soon). For now, we manually remove these observations.
# detect singleton groups: check village fixed effects and district-year fixed effects
index = lapply(list(rural.data$svgroup, rural.data$sdygroup), function(x) x[!(x %in% x[duplicated(x)])])
# how many observations need to be dropped
lapply(index, function(x) length(x))
## [[1]]
## [1] 0
##
## [[2]]
## [1] 210
# exclude singleton groups
rural.data = rural.data %>% filter(!(sdygroup %in% index[[2]]))
The results align with what presented in table 3. The new constructed roads had zero impact on local deforestation.
# Table 3
# Column (1)
log.forest.main = feols(ln_forest ~ award_only + treatment_comp|svgroup + sdygroup + year[ln_forest_2000, pc01_pca_tot_p],
cluster = "svgroup",
data = rural.data)
# Column (2)
log.forest.test = feols(ln_forest ~ treatment_comp|svgroup + sdygroup + year[ln_forest_2000, pc01_pca_tot_p],
cluster = "svgroup",
data = rural.data)
# Column (3)
avg.forest.main = feols(avg_forest ~ award_only + treatment_comp|svgroup + sdygroup + year[ln_forest_2000, pc01_pca_tot_p],
cluster = "svgroup",
data = rural.data)
# Column (4)
avg.forest.test = feols(avg_forest ~ treatment_comp|svgroup + sdygroup + year[avg_forest_2000, pc01_pca_tot_p],
cluster = "svgroup",
data = rural.data)
etable(log.forest.main, log.forest.test, avg.forest.main, avg.forest.test,
signifCode = c("***"=0.01,"**"=0.05,"*"=0.10),
drop.section = "slopes")
## log.forest.main log.forest.test avg.forest.main
## Dependent Var.: ln_forest ln_forest avg_forest
##
## award_only -0.0053*** (0.0017) -0.0334*** (0.0129)
## treatment_comp 0.0017 (0.0022) 0.0052*** (0.0017) 0.0087 (0.0152)
## Fixed-Effects: ------------------- ------------------ -------------------
## svgroup Yes Yes Yes
## sdygroup Yes Yes Yes
## year Yes Yes Yes
## _______________ ___________________ __________________ ___________________
## S.E.: Clustered by: svgroup by: svgroup by: svgroup
## Observations 688,275 688,275 688,275
## R2 0.94260 0.94260 0.92035
## Within R2 3.14e-5 1.65e-5 2.27e-5
## avg.forest.test
## Dependent Var.: avg_forest
##
## award_only
## treatment_comp 0.0133 (0.0123)
## Fixed-Effects: ---------------
## svgroup Yes
## sdygroup Yes
## year Yes
## _______________ _______________
## S.E.: Clustered by: svgroup
## Observations 688,275
## R2 0.92184
## Within R2 2.16e-6